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INTRODUCTION 

The  Macro  Function  Language,  MFL,  is  a  signal  processing 
language  developed  out  of  NADC's  MULTISENSOR  STANDARD  MACRO 
FUNCTION  STUDY  contract  N62269-79-C-0116  performed  by  the 
Submarine  Signal  Division  of  Raytheon  Company.  This  study 
identified  and  characterized  signal  processing  macro  functions 
which  are  common  to  multiple  sensor  areas.  Signal  processing  macro 
functions  are  a  complete  set  of  primitive  functions,  control 
operators  and  array  transformations  for  signal  processing 
operations.  (See  figure  1.)  A  complete  set  of  primitives  provides  ail 
the  functions  needed  to  solve  signal  processing  problems,  and 
permits  the  solution  to  be  stated  in  a  form  familiar  to  a  signal 
processing  analyst. 

NADC  formalized  this  Macro  Function  Set  into  the  MFL  language 
under  contract  N62269-83-C-0441  performed  by  Raytheon.  The 
development  of  MFL  has  been  an  effort  to  first  look  at  signal 
processing  applications,  then  at  a  primitive  macro  function  set  that 
would  conveniently  express  those  applications,  and  finally  formalize 
those  primitives  into  a  signal  processing  language,  MFL.  This  report 
will  take  the  development  process  a  step  further  by  defining  the 
hardware  that  will  most  efficiently  execute  MFL. 

Traditionally,  computer  systems  were  developed  by  designing 
hardware,  then  writing  microcode,  assembly  language,  on  up  to  a 
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common  programming  language  such  as  ADA  or  Pascal.  This  emphasis 
on  hardware  was  due  to  the  relative  high  cost  of  hardware.  In  the 
current  state  of  the  art,  software  costs  are  far  greater  than 
hardware  over  the  life  of  a  signal  processor.  Because  of  this  ,  MFL 
attempts  to  standardize  the  software  Interface  to  hardware.  This 
interface  will  standardize  the  software  between  signal  processors 
designed  to  run  MFL  down  to  the  microcode  level.  This  report  will 
examine  this  standard  interface  and  its  effect  on  hardware  designs. 

This  report  deals  with  the  hardware  considerations  that  MFL 
requires  for  an  efficient  implementation  with  only  a  brief 
introduction  to  the  language.  For  a  complete  description  of  the 
language  see  NADC  report  #N62269-83-C-0441 , 
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Figure  1 
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AN  MFL  OVERVIEW 

Signal  processing  data  is  highly  structured.  The  data  is 
naturally  in  the  form  of  arrays  of  vectors  and  matrices.  This  is  in 
contrast  to  data  processing  where  the  data  is  random  and 
unstructured  and  the  emphasis  is  on  procedural  control.  The 
structure  on  the  data  must  be  provided  by  the  language.  Exactly  as 
the  math  of  signal  processing  applies  functions  to  arrays  MFL 
applies  functions  to  arrays.  This  close  tie  of  MFL  to  signal 
processing  math  makes  the  language  easy  to  learn  for  someone  who 
is  already  familiar  with  the  mathematics  of  signal  processing  {in 
fact  too  much  procedural  programming  experience  can  actually  be  a 
hindrance  to  learning  MFL).  MFL  performs  array  manipulation  using  a 
construct  called  the  array  transformation.  The  array 
transformation,  to  be  discussed  later,  can  conveniently  express  any 
sequencing  or  rearrangement  of  data  within  arrays.  (  eg.  -- 
transpose) 

In  an  efficient  vector  processor,  operations  to  align  and  select 
elements  from  arrays  must  occur  in  parallel  with  arithmetic 
operations.  These  tasks  can  complicate  arithmetic  hardware,  and  th*^* 
programming  of  that  hardware,  if  the  alignment  or  shifting  must  be 
done  in  the  arithmetic  unit.  MFL  breaks  data  manipulation  and  math 
processing  into  a  natural  division  of  tasks.  The  data  manipulation  is 
done  by  the  MFL  array  transformation  code  which  gives  the 
programmer  a  flexible  addressing  mechanism.  The  math  processing 
is  done  by  the  MFL  math  primitives. 
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These  two  types  of  MFL  code  are  partitioned  graphically  as  shown  in 
figure  1. 


Figure  2 


The  four  boxes  in  figure  2  represent  the  four  fields  of  MFL  code. 

The  boxes  marked  A,B,  C  hold  the  array  transformation  code,  the 
code  that  selects  and  aligns  elements  from  data  arrays  in  memory, 
and  the  data  descriptors.  The  box  marked  "function"  holds  the  math 
primitives  and  control  operators.  In  figure  3  a  generic  MFL 
processor  is  shown.  The  code  from  boxes  A,B  C  in  figure  2  drive  the 
smart  port  hardware  ports  A,  B,  C.  The  math  primitives,  the  code  in 
the  box  marked  "function"  in  figure  2,  drive  the  arithmetic  unit  in 
figure  3. 
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A  GENERIC  MFL  MACHINE 


Figure  3 
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The  general  format  of  an  array  transformation  and  data 
descriptor  Is; 


DATANAME 

C16  10;20 

{A4  A3  A2  A1  1 

AO  ) 

{  L4  L3  L2  L1  I 

B  ) 

The  data  name,  DATANAME  is  the  name  of  the  array.  The  second  line 
says  16  bit  complex  data  and  a  10  by  20  array.  The  third  line,  delta 
1,  2,  3  and  4,  gives  the  displacement  directions,  the  displacement 
direction  is  either  in  the  row,  column  or  depth  directions,  and  delta 
0  is  the  starting  point  for  the  read.  The  fourth  line  gives  the  length 
of  each  of  the  displacements,  LI,  L2,  L3,  and  L4,  that  is  how  far  the 
read  in  that  direction  is  to  be,  and  the  boundary  mode,  B,  (wrap 
around  or  zero  fill,)  which  tells  what  to  do  when  the  end  of  the  row 
or  column  is  reached. 

For  this  report  it  is  important  to  know  the  graphical 
layout  of  MFL  code,  i.e.  the  fields  that  hold  the  array 
transformations  and  data  descriptors  and  the  field  that  holds  the 
mathematical  instructions  of  the  signal  processing  operation.  This 
division  buys  the  programmer  a  straight  forward  way  of  dividing  up 
the  signal  processing  operation  and  as  this  report  will  show,  if  the 
hardware  is  built  for  the  language,  it  eliminates  the  need  to  change 
microcode  with  each  new  application.  The  microcoding  then  is  part 
of  the  hardware  design  cycle  and  after  that  it  is  not  dealt  with. 
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MFL  can  be  viewed  as  a  group  of  reduced  instruction  sets 
optimized  for  signal  processing.  Each  set  can  be  efficiently 
implemented  directly  in  hardware  to  eliminate  the  need  for 
microcode. 


A  PROGRAMMING  EXAMPLE  FROM  THE  MFL  WORKSTATION 

A  complete  MFL  workstation,  for  writing  and  debugging  MFL 
code,  has  been  implemented  on  a  Macintosh  computer.  Figure  4 
shows  a  MFL  program  on  the  workstation.  The  program  is  called 
dcp_dcr.mfl,  and  is  an  adaptive  filter  for  estimating  and  removing 
the  DC  component  from  a  waveform  to  form  a  zero-mean  waveform. 
The  figure  shows  the  workstation's  user  interface.  A  precomputed 
estimate  of  DC  average,  SO,  is  first  subtracted  from  the  input  sine 
wave,  X.  Then  the  new  DC  average  is  found.  Since  the  sine  wave,  X, 
is  a  vector,  it's  average  is  simply  the  sum  of  it'*,  elements  divided 
by  it's  number  of  elements  (denoted  by  X(I3).  This  average  is 
weighted  by  the  factor  A  and  used  to  update  the  DC  average 
estimate,  SO.  The  figure  shows  the  output,  Y,  of  the  program  after 
the  program  has  executed.  Data  in  any  of  the  windows  can  easily  be 
changed  and  the  algorithm  re-executed.  This  example  is  given  to 
show  the  windowed  interactive  enviroment  of  the  workstation  and  a 
complete  MFL  program. 


NAWCADWAR-921 00-50 


9 


NAWCADWAR-92100-50 

Basic  MFL  Design 
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The  MFL  language  specifications  should  be  used  to  design  an 
efficient  MFL  processor.  Two  key  points  in  MFL  processor  design 
philosophy  are:  (1)  an  MFL  processor  is  in  effect  a  high  level 
machine,  the  processor  is  built  to  optimize  the  constructs  of  the 
language,  and  the  Assembly  language  programming  for  this  machine- 
MFL-  is  then  a  high  level  language;  (2)  the  MFL  processor  can  be 
viewed  as  4  processors  running  in  lock  step.  The  code  that  drives 
each  of  the  4  processors  (  3  smart  ports  and  the  arithmetic  unit)  is 
the  code  in  the  4  boxes  in  the  MFL  code  fields  on  the  left  of  figure  5. 
The  four  MFL  processors  that  the  code  in  each  of  these  fields  drive 
are  on  the  right  of  figure  5.  These  four  processors  can  be 
completely  independent  at  run  time  or  they  can  be  controlled  by  a 
Command  interpreter  (see  Command  interpreter  section  -  the 
independence  of  the  four  processors  can  vary). 
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Figure  5 

A  MFL  processor  is  a  three  address  machine,  the  three 
addresses  being  supplied  by  smart  ports  A,  B,  and  C  with  a 
Arithmetic  Unit  optimized  to  perform  the  class  of  algorithms  the 
processor  will  work  on.  During  each  cycle  of  a  MFL  processor  the 
two  input  data  elements  are  supplied  to  the  upper  stage  of  the  AU  by 
smart  ports  B  and  C  and  smart  port  A  writes  the  output  of  the  lower 
stage  into  memory.  These  4  elements,  the  Smart  Ports  and  the  AU, 
are  the  pipe  in  a  MFL  processor. 
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ARITHMETIC  UNIT 

The  MFL  Arithmetic  Unit  is  a  simplified  AU.  It  doesn't  need  the 
shifters,  status  registers,  pre-  and  post  scalers,  or  rounders  of 
current  arithmetic  units.  No  data  formatting  or  sequencing  is  done 
in  the  arithmetic  unit-it  is  strictly  a  number  cruncher.  A  separate 
and  flexible  addressing,  sequencing  and  data  formatting  mechanism 
is  found  in  each  of  the  Smart  Ports  of  the  MFL  processor.  When  data 
reaches  the  AU  it  is  formatted  and  ready  for  processing.  Therefore, 
a  uniform  number  representation  for  the  inputs  and  outputs  of  the 
AU  is  needed  and  this  function  is  handled  by  the  Data  Formatter 
portion  of  the  Smart  Port. 

A  basic  two  stage  MFL  AU  is  shown  in  figure  6.  This  two  stage 
AU,  with  a  ALU  and  multiplier  in  the  upper  stage  and  a  ALU  and 
accumulator  in  the  lower,  fits  a  multiply-accumulate  structure  of 
many  DSP  algorithms.  There  are  many  AU  structures  that  fit  in  a 
MFL  design,  A  MFL  AU  is  solely  the  number  cruncher  of  the  processor 
and  it  should  be  designed  to  optimize  the  class  of  problems  that  the 
processor  will  work  on. 
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Figure  7  shows  the  mapping  of  the  AU  code  of  a  multiply- 
accumulate  example  to  the  AU  shown  of  figure  6.  The  MFL  code 
reads,  from  left  to  right:  multiply  the  inputs  from  smart  ports  B  and 
C  then  add  them  to  the  previous  product.  This  multiply  accumulate 
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form  has  the  structure:  {#)(#)  +  (#)(#)  +  .  The  machine  code  of 

the  adaptor  would  select  the  multiply  operation  for  the  upper  stage 
and  the  add-accumulate  operation  for  the  lower  stage. 
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Figure  8 

The  MFL  adaptor,  a  simple  compiler,  translates  each  half  of  the 
MFL  function  field  into  the  machine  code  that  drives  the  two  stages 
of  the  MFL  AU  (figure  8).  The  left  side  of  the  function  field  drives 
the  lower  stage  and  the  right  side  drives  the  lower  stage.  A  single 
stage,  triple  or  even  a  quadruple  stage  AU  is  possible,  it  merely 
complicates  the  adaptor's  mapping  of  the  MFL  function  code  to  the 
AU. 
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SMART  PORTS 

The  MFL  Smart  Ports  are  as  important  as  the  AU  in  a  MFL 
processor.  A  successfully  designed  MFL  processor  will  be  based  on 
the  effective  implementation  of  this  key  element.  In  this  section 
the  Smart  Port  will  be  discussed  and  the  basic  Smart  Port 
architecture  will  be  covered  in  detail  with  particular  attention  paid 
to  the  address  generator  which  is  the  central  feature  of  the  MFL 
Smart  Port  processor. 

In  traditional  AU  design,  the  AU  must  transform  data  items  into  a 
form  appropriate  for  the  AU  operation.  The  AU’s  then  switch 
between  modes  of  data  formatting  and  arithmetic  manipulation  for 
each  data  item.  This  switching  usually  requires  additional  control 
from  the  processor  control  unit.  The  Smart  Port  removes  this 
overhead  from  the  AU  and  the  Command  Interpreter-  the  processor 
control  unit  of  MFL. 

The  Smart  Port  fetches  individual  elements  from  memory  and 
delivers  them  to  the  AU  in  the  right  form  and  the  right  order.  The 
Smart  Port  primitives,  the  MFL  array  transformation  code,  are  a  set 
of  data  manipulation  operators.  These  primitives  are  translated  into 
the  machine  code  that  drives  the  Smart  Port.  Once  initialized  the 
accessing  of  data  items  goes  on  concurrently  with  the  AU 
processing.  Thus,  the  Smart  Port  is  a  completely  separate  processor 
with  it's  own  programming  code,  the  array  transformation,  that 
works  concurrently  and  in  lock  step  with  the  AU,  The  Smart  Port  can 
access  long  strings  of  data  items  for  processing  by  the  AU  without 
intervention  by  the  processor  control  unit.  Before  taking  a  detailed 
look  at  the  Smart  Ports  address  generator  architecture  it  will  be 
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helpful  to  briefly  review  the  MFL  array  transformation  code  since 
the  architecture  is  designed  around  this  part  of  the  language. 

Survey  of  Array  Transformations 
(for  complete  documentation,  see  NADC  Report  #N62269-83-C-0441 
MACRO  FUNCTION  SET  FORMALIZATION) 


The  general  form  of  an  array  transformation  is: 


DATANAME 

C16  10;20 

{A4  A3  A2  A1  I 

AO  } 

{  L4  L3  L2  LI  1 

B  } 

Where  DATANAME  is  the  filename.  The  second  line  reads  complex  16 
bit  data  in  a  10X20  array.  The  deltas  In  the  third  line  are  the 
displacements  or  the  directions  that  the  reads  take  place  in.  The 
delta-zero  being  the  starting  point  of  the  first  read.  For  example  a 
normal  read  for  a  matrix  would  look  like  this: 

DATANAME 
Cl  6  10;20 

{ j  i  I  0  } 

{  J  I  I  w} 

delta-1  is  an  "i"  therefore  the  direction  of  the  first  read  would  be  in 
the  row  direction.  The  "LI"  below  the  delta-1,  in  this  example  a  "I" 
means  read  to  the  end  of  the  row.  If  the  programmer  only  wanted  a 
certain  number  of  data  items  from  the  row  read,  say  5.  the  code 
would  look  like: 
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DATANAME 
Cl  6  10;20 

{jiiO} 

{J5lw} 

One  could  have  read  the  data  in  transposed  order  by  reading  in  the 
column  direction  for  the  row  elements.  The  code  would  the  be: 

DATANAME 
C16  10;20 

{ijlO} 

{  IJ I w} 

This  array  would  then  be  delivered  to  the  Arithmetic  Unit  column  by 
column  instead  of  row  by  row.  The  delta-1,  LI  sets  are  in  effect  DO 
loops.  The  first  set  being  the  inner  loop  and  the  delta-2,  L2  being 
the  outer  loop.  The  "w"  in  the  lower  right  hand  corner  is  the 
boundary  mode.  In  this  example  the  boundary  mode  is  "wrap  around". 
Wrap  around  means  that  when  the  number  of  reads  in  the 
displacement  direction,  the"L1"  "L2"'s  etc,  exceed  the  number  of 
elements  in  the  row,  the  reading  wraps  around  to  the  first  element 
in  the  row  until  the  "LI"  number  of  data  points  has  been  output. 

SMART  PORT  ARCHITECTURE 
The  Smart  Port  supplies  data  to  the  Arithmetic  Unit  by  executing 
the  addressing  sequences  specified  by  MFL’s  array  transformation 
code.  Two  Smart  Ports  read  data  from  memory  to  the  AU,  ports  C 
and  B,  and  a  third  writes  the  AU  output  back  into  memory,  port  A. 
Figure  9  shows  an  Implementation  of  an  MFL  processor  (the  Air 
Force's  AOSP). 
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Figure  9  AOSFS  MFSP 


Note  the  AU  in  the  processor  of  figure  9.  It  has  been  optimized, 
with  4  multipliers  and  two  ALU's  in  the  first  stage  and  4  ALU's  in 
the  second  stage,  for  the  class  of  problems  the  processor  will  work 
on. 

The  architecture  of  each  AOSP  Smart  Port  is  shown  in  figure  10. 
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The  AOSP  Smart  Port  includes  an  address  generator,  a  data 
formatter,  and  a  memory  controller.  The  address  generator  produces 
addresses  to  access  a  data  array  stored  in  the  memory  banks.  The 
data  formatter  translates  the  data  to  a  form  convenient  for  the  AU. 
In  the  AOSP  configuration,  the  data  is  packed  in  memory  in  64  bit 
words  and  the  data  sent  to  the  AU  is  unpacked  and  left  justified. 

The  memory  controller  provides  control  for  the  address  generator 
and  the  data  formatter,  as  well  as  providing  interfaces  to  the  AU 
(control  lines)  and  the  Cl  (ABUS).  The  memory  controller  initiates 
and  controls  all  memory  accesses  by  the  Smart  Port. 

The  key  piece  of  hardware  in  the  MFL  Smart  Port  is  the  address 
generator  and  in  the  next  section  of  the  report,  the  address 
generator  will  be  covered  in  detail.  The  array  transformation  code 
of  MFL  is  the  program  code  of  the  address  generator. 


ADDRESS  GENERATOR  ARCHITECTURE 
This  section  will  examine  the  Smart  Ports  address  generator 
architecture  with  a  few  examples  of  a  register  by  register  mapping 
of  MFL  code  into  the  address  generator  registers.  These  examples 
will  show  that,  instead  of  microcode,  the  Smart  Port  is  programmed 
by  simply  passing  parameters  to  it's  registers.  This  concept  of 
parameter  passing  vs.  microcode  is  a  key  feature  in  the  flexibility 
and  programability  of  an  MFL  design.  Thus  the  array  transformation, 
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the  MFL  code  that  drives  the  Smart  Port,  sets  up  the  registers  of  the 
Smart  Port  and  the  Smart  Port  is  simply  a  processor  that  has  been 
hardwired  to  perform  one  function-the  array  transformation. 

MFL  array  transformation  code  has  the  form: 


DATANAME 

C16  10;20 

{A4  A3  A2  A1  i 

AO  } 

{  L4  L3  L2  LI  1 

B  ) 

Because  the  MFL  processor  is  built  around  the  MFL  language  the 
processor  can  be  thought  of  as  a  high  level  machine,  and  the  low 
level  program  code  of  the  machine  is  MFL,  MFL  is  then  to  the  user  a 
high  level  language.  The  translation  from  the  MFL  code  to  the  actual 
machine  code  that  drives  the  hardware  is  then  a  simple  process  and 
the  MFL  code  can  be  thought  of  as  the  microcode  of  the  MFL 
processor,  (the  lowest  level,  simplest  mnemonic,  closest  to  the 
hardware,  code  of  the  processor). 

In  this  first  array  transformation  example,  the  simple,  low  level 
relationship  of  MFL  code  to  architecture  can  be  seen.  The  MFL  code 
has  a  one  to  one  relationship  with  the  register  contents  and 
programming  of  the  different  parts  of  the  hardware.  Figure  11  is  an 
overall  look  at  the  address  generator  architecture.  On  the  left  in  the 
length  section,  the  length  register  holds  the  LI,  L2,  etc.  values  and 
the  length  counter  is  a  simple  counter  that  increments  at  each  cycle. 
The  comparison  of  the  length  counter  with  the  length  register 
determines  which  loop  the  address  generator  is  processing.  The 
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examples  will  demonstrate  this  process  in  detail.  The  registers  on 
the  right  hold  the  delta  displacements  and  the  delta  accumulators 
hold  the  current  output  index.  The  current  output  index  is  then  used 
to  compute  the  address  of  data  to  be  fetched  from  or  deposited  into 
memory. 

The  length  counter  and  delta  accumulator's  registers  are  like 
odometers  that  count  up  to  their  respective  limits-the  L#'s  .  and 
then  reset.  The  level  number  is  a  2-bit  address  that  holds  the 
current  "level"  of  the  pointers.  The  boundary  adjust  checks  to  see  if 
the  current  output  index  is  over  the  actual  row  and  column  length 
and  resets  the  index  accordingly,  (wrap  around  and  zero  fill).  The 
final  address  is  calculated  by  the  formula: 

(column  index)(row  length)  +  (row  Index)  +  (base  address)  -  memory 
address. 

The  row  and  column  length  adjust  is  the  simple  test: 

1)  if  index  <0  (wrap  around)  add  row  length 
(zero  fill)  set  zero  fill  flag 
2)  if  index  >-  row  length 

(wrap  around)  subtract  row  length 
(zero  fill)  set  zero  fill  flag 

In  wrap  around  mode  when  the  reading  hits  the  end  of  a  row  it  wraps 
around  to  the  begining  of  the  row  and  in  the  zero  fill  mode  when  the 
end  of  the  row  is  reached  zeros  are  outputed  till  the  end  of  the  read. 
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The  first  example  is  a  7  point  sliding  window  operating  across  a 
vector.  Figure  11  shows  the  overall  architecture  of  the  Smart  Port 
for  this  read.  The  MFL  code  of  this  example  is:  {  2i  i  |  0  } 

{  25  7  I  w } 

The  code  tells  the  port  to  read  7  data  points,  slide  two  points  over, 
read  7  more,  etc.  until  25  sets  of  7  data  points  have  been  read. 


data  element  # 


i 


0123456789 


vector- 


hH  +  f-H  I  I  l  -l  I  I  H  I  I  I 


1  2  3  4  5  6  7 

sliding  windowj — j — | — j — j — j — j  - ^  output  of  first  read 


1  2  3  4  5  6  7 


output  of  second  read 


Before  the  Smart  Port  can  execute  the  required  MFL  code,  the 
array  transformation,  the  code  must  first  be  decoded  and  loaded  into 
the  Smart  Port  registers.  The  number  of  points  to  be  output  are 
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loaded  into  the  address  generator's  length  registers  (the  inner  loop 
is  on  top)  and  the  delta  displacements  are  loaded  into  the  delta 
registers  (again  the  inner  loop  is  on  top).  Finally,  the  length 
counters  are  initialized  to  zero  and  the  delta  accumulators  are 
initialized  to  the  starting  point  for  the  first  read.  The  initial  state 
of  the  registers  is  shown  in  figure  12.  Note  the  one  to  one  mapping 
of  register  with  the  array  transformation  code. 

ADDRESS  GENERATOR 


IBIGTH  DELTA 

1£NGTW  CXXNTER  AtXXJMULATOR 


ADDRESS 

(BeVtTOR 


Figure  12 
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During  the  first  cycle,  the  delta  accumulator  values  (first  index)  is 
output,  the  length  counter  is  compared  with  first  length  register, 
the  counter  is  less  than  the  length  register,  the  counter  is 
incremented  and  the  contents  of  the  delta  register  is  added  to  it’s 
accumulator.  The  results  of  this  first  cycle  is  shown  in  figure  13. 

ADDRESS  GENERATOR 


leOi  DELTA 

length  COLKTER  0£lTA  ACCIMXATOR 


AOOR^ 

(BERATOR 


Figure  13 
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In  the  next  cycle,  the  current  index  is  first  output,  the  L1  length 
register  is  compared  with  it's  counter,  the  counter  is  greater  than 
the  contents  of  the  length  register,  so  the  level  stays  at  one  and  the 
length  counter  is  incremented  by  one.  The  contents  of  the  deita-1 
register  are  added  to  its  accumulator.  This  simple  incrementing  and 
adding  process  goes  on  until  the  length's  counter  reaches  7,  (  7 
indexes  have  been  output).  The  result  is  shown  in  figure  14. 
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ADDRESS  GENERATOR 


LB^  DELTA 

1£NGTH  OONIER  delta  ACCUMULAtor 


ADDRESS 

GBCRATOR 

Figure  14 

In  the  next  cycle  when  the  length's  counter  is  compared  to  the  LI 
register  they  are  equal  therefore  the  level  is  incremented  to  level 
2.  the  counter  of  level  2  is  incremented  and  the  accumulator  at  level 
1  is  reset.  D2  is  added  to  its  accumulator  and  the  first  level's  delta 
accumulator  is  reset  and  the  level  goes  back  to  one  for  another  round 
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on  the  inner  loop  of  the  array  transformation.  The  result  is  shown  in 
figure  15. 

This  process  continues  until  the  outer  loop,  here  L2,  has  reached 
it's  limit  and  the  array  transformation  is  finished. 

ADDRESS  GENERATOR 


IBJGTH  DELTA 

LENGTH  COUNTER  QgLXA  ACCUMULATOR 


ADDRESS 

GBERATOR 


Figure  15 
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The  next  example  will  be  a  sliding  window  on  a  matrix.  The 
output  will  be  a  block  of  data.  Consider  the  MFL  code: 

{]  2i  i  10} 

{  3  25  7  1  w  ) 


Graphically  the  read  looks  like: 

j  I - ^ 


The  array  being  read  is  a  two  dimensional  array  (  i  X  j).  The 
array  being  output  is  in  three  dimensions.  Thus  each  row  of  the 
array  being  read  is  read  as  in  the  first  example,  forming  a  25X7 
array,  and  this  read  is  done  three  times  moving  down  by  one  in  the 
column  direction.  The  output  passed  to  the  AU  is  a  block  of  data 
3X25X7. 

More  registers  and  counters  are  needed  to  do  this  example  but  the 
process  of  loading  and  cycling  through  the  array  transformation  is 
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basically  the  same.  Delta  registers  are  needed  for  each  read  in  the 
"i"  direction.  Here,  referring  back  to  the  general  form  of  the  array 
transformation,  there  is  a  read  in  the  "i"  or  row  direction,  in  the  LI 
and  L2  slot  and  in  the  "j"  or  column  direction  in  the  L3  slot.  After 
the  registers  are  initialized  the  result  is  shown  in  figure  16. 
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The  next  slide  shows  the  result  of  the  first  cycle  as  m  the 
first  example:  the  length  counter  has  been  incremented  and  the 
delta’s  have  been  added  to  their  accumulators.  The  result  is  shown  m 

figure  17. 
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MEMORY  ADDRESS 


Figure  17 


And  again  after  another  cycle: 
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MEMORY  ADDRESS 


Figure  18 
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The  cycles  continue  at  this  level  until  the  LI  register  equals  it's 


counter  register  as  shown  In  figure  19: 


MEMORY  ADDRESS 


Figure  19 
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Then  the  level  moves  up  by  one,  the  second  level  length  counter  is 
incremented  and  the  second  level  deltas  are  added  to  their 
accumulators 

the  result  is  shown  in  figure  20: 


MEMORY  ADDRESS 


Figure  20 
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And  the  level  drops  back  to  the  first  : 


MEMORY  ADDRESS 


Figure  21 
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And  the  inner  loop  is  repeated.  This  goes  on  until  the  12  counter  is 

equal  to  the  L2  length  register  as  shown  in  figure  22; 

i  eompontnt  1  component 


LENGTH 

REGISTER 


LENGTH 

COUNTER 


DELTA  I  DELTA 

ACCUMULATOR  ?  ACCUMULATOR 


DELTA 

REGISTER 


DaTA 

REGISTER 
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Then  the  pointer  moves  up  one  to  the  third  level,  the  length  counter 

at  that  level  is  incremented  and  ail  the  length  counters  below  it  are 

cleared.  The  result  is  shown  in  the  next  figure. 

I  component  j  component 


LENGTH 

REGISTER  COUNTER 


DELTA 

REGISTER 


DELTA  1  DELTA 

ACCUMULATOR  f  ACCUMUUTOR 
DaTA 
REGISTER 


BOUNDRY 

ADJUST 


BOUNDRY 

ADJUST 


CALCULATC  ADDRESS 


MEMORY  ADDRESS 


Figure  23 
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And  the  level  drops  back  down  to  the  first  level,  doing  the  inner 

loops  again  until  the  L3  register  equals  3. 

i  component  j  component 


Figure  24 
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In  conclusion  smart  memory  operations  can  be  summarized: 

INTELLIGENT  MEMORY  OPERATIONS 

TYPE:  OPERATION: 

-  Address  Generation  -  Array  Transformation 

-  Element  Selection  and  alignment  -  Unpack/Pack  Data 

’  Select  REAL  or  IM  part 
>  Unsigned-Signed  Conversion 

-  Set  Output  to  Zero/One 

-  Left/Right  Shift 


SPEED  ISSUES 

A  key  technology  question  in  the  smart  port  processor  is 
whether  or  not  it  can  keep  up  with  the  AU.  AU's  generally  run  at  high 
clock  rates  and  the  Smart  Port  of  a  MFL  processor  is  now  part  of  the 
pipe  and  must  feed  the  AU  it's  data,  formatted  and  in  the  right  order, 
in  lock  step  with  the  AU.  A  critical  technical  point  to  consider  is 
whether  or  not  the  Smart  Port  can  keep  up  with  the  AU.  The  AU’s 
pipeline  is  in  effect  extended  to  include  the  Smart  Port,  in  each 
clock  cycle,  the  Smart  Port(s)  must  generate  3  addresses,  access 
and  deliver  2  data  points,  and  store  1.  MFL  follows  the  current  trend 
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of  RISC  instruction  sets.  As  the  memory  technologies  have 
increased  access  times,  the  RISC  trend  has  been  to  perform  more 
accesses  and  in  simpler  instructions  per  operation.  In  the  older 
complex  instruction  set  architectures,  because  of  the  slower 
memory  access  speeds,  the  central  processor  tried  to  do  as  much 
processing  as  possible  on  each  memory  access.  Now  that  memory 
technologies  and  therefore  memory  accesses  have  caught  up  to  AU 
speeds,  ECL  5ns,  GaAs  1ns  and  static  with  RAMS  35ns  access  times, 
the  older  complex  processing  instructions  have  been  broken  down  to 
a  smaller  and  simpler  instruction  sets  so  that  there  are  more 
processing  cycles  and  memory  accesses  per  operation  and  they  run 
at  higher  speeds.  With  hardware  hardwired  to  perform  complex 
instructions  a  speed  penalty  was  paid  but  now  with  simpler 
instructions  and  faster  memories  the  RISC  philosophy  has  increased 
processing  speeds. 

The  Smart  Port  architecture  that  was  reviewed  in  the  previous 
section  shows  the  amount  of  processing  required  for  each  Smart 
Port  operation.  This  processing  added  to  the  access  time  of  the 
memory  technology  used  increases  the  overall  access  time  of  the 
Smart  Port.  In  order  to  have  the  fastest  access  time  for  the  overall 
Smart  Port,  a  pipelined  architecture  is  suggested.  Then  the  access 
time  of  the  memory  chips  themselves  become  the  limiting  factor  in 
the  speed  of  the  Smart  Port  and  the  processing  required  for  each 
access  is  done  in  the  pipeline  and  is  transparent  to  the  actual 
access.  An  increase  in  speed  beyond  the  access  time  of  the  memory 
chips  would  require  interleaving  between  memory  banks,  in  MFL  this 
adds  a  very  diffucult  complication  because  of  the  array 
transformations.  Array  transformations  allow  the  memory  to  bo 
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read  in  a  completely  arbitrary,  programable  way  within  a  single 
memory,  but  in  an  interleaving  strategy  reads  must  be  interleaved 
across  memory  banks  for  full  throughput.  This  requires  that  the  data 
be  stored  in  a  prearranged  order  for  the  array  transformation  to 
work  properly.  Thus  the  limiting  factor  is  a  straight  forward  MFL 
Smart  Port  design  is  simply  a  memory  technology  with  an  access 
time  that  matches  the  cycle  time  of  the  Arithmetic  Unit. 

In  conclusion  the  MFU  Smart  Port  is  a  processor  hardwired  to 
perform  array  transformations.  It  is  programmed  by  passing 
parameters  to  its  registers.  Because  of  the  amount  of  processing 
per  cycle  to  be  done  in  the  Smart  Port  a  pipelined  processor  for  the 
Smart  Port  is  suggested. 
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Command  Interpreter 

The  Command  Interpreter  coordinates  the  processing  of  the 
three  Smart  Ports  and  the  Arithmetic  unit.  The  role  of  the  Command 
Interpreter  at  runtime  varies  depending  on  how  Independent  the  four 
MFL  processors  are;  the  AU,  and  the  three  Smart  Ports.  The  Cl  is 
responsible  for  setting  up  the  Smart  Ports  and  the  AU,  program 
sequencing,  address  modification  (looping),  and  data  fetch  requests. 
It  obtains  the  required  description  of  data  arguments  through  their 
data  descriptors  and  sets  up  the  appropriate  pipeline  operations 
required  to  execute  the  instructions. 

The  Command  Interpreter  function  can  be  performed  during  run 
time  or  at  compile  time.  In  the  run  time  mode  the  four  processors  of 
the  MFL  processor  operate  in  a  master-slave  function  to  the 
Command  Interpreter.  In  general  the  Cl  does  not  involve  itself  with 
the  detailed  control  of  the  AU  and  Smart  Ports  after  set  up  in  this 
mode.  By  releiving  the  Cl  of  this  burden  it  can  anticipate  the  next 
instruction  and  begin  the  next  set  up. 

The  two  main  functions  of  the  Command  Interpreter  in  this  run 
time  mode  are: 

Interpreting  MFL  commands  from  the  four  code  fields, 
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translatiriQ  thasa  commands  into  tha  appropriata  machina  coda  and 
dalivaring  that  coda  to  tha  spacifiad  procassors-tha  AU  or  Smart 
Ports. 


The  command  interpreter  then  is  a  program  interpreter  and 
intellegent  program  counter.  It  puts  the  MFL  code  into  hardware  and 
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then  is  a  program  sequencer  during  run  time  controlling  program 
flow  and  looping  operations. 

On  the  other  hand  the  role  of  the  Command  Interpreter  during 
run  time  could  be  eliminated  by  a  MFL  compiler  that  handies  the 
Command  Interpreter’s  function  at  compile  time.  The  sequencing  of 
the  four  processors  would  be  coordinated  and  the  opcodes  and  data 
would  be  down  loaded  to  the  four  processors.  Then  at  run  time  the 
processor  would  have  no  outside  involvement.  The  four  MFL 
processors  would  then  be  four  completely  independent  processors 
running  in  lock  step.  This  approach  helps  with  one  of  the  potential 
bottlenecks  in  a  MFL  processor:  the  set  up  time  to  pass  parameters 
to  the  smart  ports.  The  MFL  processors  in  this  mode 
would  then  each  have  it’s  own  microstore  to  hold  their  run  time  code 
as  shown  in  the  next  figure. 
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In  summary,  the  Command  Interpretter  then  coordinates  the 
processing  of  the  four  independent  MFL  processors:  the  Arithmetic 
Unit  and  the  three  Smart  Ports,  and  it  can  perform  this  program 
sequencing  at  run  time  or  at  compile  time. 
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CONCLUSION 

In  conclusion  the  MFL  processor  can  be  thought  of  as  four 
independent  processors  working  in  lock  step.  The  code  in  the  four  MFL 
code  fields  drive  the  different  MFL  processors.  MFL  has  a  lov; 
instruction  set  up  overhead  because  this  division  of  tasks  to  the  four 
MFL  processors:  the  3  Smart 

Ports  and  the  Arithmetic  Unit.  MFL  also  has  a  mathematically  based 
complete  instruction  set  with  a  unique  array  transformation  capability. 

In  this  factored  parallel  operation  of  a  MFL  processor  each  cycle 
must:  generate  three  addresses(SP)-access  2  and  store  1  data  point 
(SP),  perform  2  arithmetic  operations  (AU),  and  perform  the  Command 
Interpretter's  functions  if  the  Command  Interpreter  is  operating  in  the 
interpreter  mode. 

MFL  has  the  unique  capability  of  high  level  to  low  level  code 
efficiency  when  the  hardware  has  been  built  around  the  language.  The 
break  down  of  this  code  from  a  macro  level  to  the  machine  code  is 
shown  in  the  next  figure. 
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USSR 


programs,  macros,  primitives 


transpose 


B 

times 

A 

C 

{Oj  i} 

B 

{jO  i} 

+/  :  * 

A 

Machine  Code 

for  AU  and  Smart  Port 


Figure  1 

Macros  can  easily  be  built  from  all  the  lower  constructs  of  the 
language  so  that  the  engineer  can  program  in  a  very  high  level  of  code 
only  invoking  commands  such  as  FFT,  or  BANDSHIFT  and  supplying  the 
proper  inputs.  The  engineer  familiar  with  the  signal  processing  math 
can  also  go  below  the  macro  level  and  write  his  own  customized  MFL 
code.  MFL  is  very  different  from  microcode  in  this  respect.  The 
progression  from  the  highest  level  to  the  most  primitive,  MFL  subset 
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level  as  shown  in  figure  1,  is  structured  and  orderly  and  much  easier 
code  for  a  programmer  to  read  than  microcode.  Microcode  is  also  a  very 
hard  language  to  reread  once  the  programmer  has  left  the  program  for 
any  length  of  time,  the  code  quickly  becomes  unintelligible.  MFL,  at 
first  glance,  is  a  bit  confusing  but  once  the  language  is  learned,  the 
code  becomes  very  readable  due  to  it's  close  ties  to  the  math  of  the 
algorithm  and  code  left  for  any  length  of  time  is  easy  to  pick  up  and 
read,  if  programming  with  macros,  an  attribute  that  most  HOL  strive 
for. 

MFL  is  a  complete  set  of  signal  processing  primitive  mathematical 
and  control  functions.  This  set  can  be  trimmed  and  the  hardware  then 
optimized  to  suit  a  particular  class  of  problems  that  the  processor  wilt 
perform.  Thus  MFL  can  be  thought  of  as  a  hardware  design  philosophy 
that  will  deliver  a  signal  processor  whose  lowest  level  of  code  will  be 
programmed  in  MFL. 

The  next  section  of  this  report  is  a  brief  overview  of  the  salient 
features  of  the  MFL  language  and  the  MFL  processor. 
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Synopsis  of  MFL 

A  line  of  Macro  Function  Language,  MFL.  code  is  written  in  a  four 
field  template  shown  in  figure  2. 


Graphic  Format: 


B,C  Smart  ports  READING 
A  Smart  Port  WRITING 

f  Arithmetic  Functions 


Figure  2 
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The  code  within  each  of  these  four  fields  drives  a  different 
processor  within  the  MFL  processor.  In  figure  3  the  code  within  the 
fields  marked  A,  B,  C  and  "f"  on  the  left  of  the  figure,  drive  the  four 
processors  in  the  generic  MFL  machine  on  the  right. 


MFL  Graphical 
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Smart  port  code,  array  transformations,  has  the  form: 


DATANAME 

C16  10;20 

{A4  A3  A2  A1  I 

AO  } 

{  L4  L3  L2  L1  I 

B  ) 

Figure  4 

The  smart  port  is  a  special  purpose  processor  hardwired  to  perform  the 
array  transformation.  The  smart  port  is  programmed  by  passing  the 
parameters  shown  in  figure  4  to  registers  in  the  smart  port,  instead  of 
microcoding  the  smart  port.  The  MFL  Smart  Port,  with  the  variables 
from  figure  3  for  a  two  dimensional  read,  are  shown  with  their 
register's  initialized  in  figure  5. 
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ADDRESS  GENERATOR 


DELTA 


Memory  Address 


Figure  5 
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The  MFL  Arithmetic  Unit.  AU,  is  a  simplified  AU.  The  data  passed 
to  the  AU  is  formatted  and  in  the  right  order.  The  AU  then  is  simply  the 
number  chruncher  of  the  MFL  processor  and  should  be  designed  to 
optimize  the  class  of  problems  that  it  will  work  on.  The  basic  primitive 
MFL  AU  code  is  shown  in  the  following  figure. 
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Minimum  Function  Box  Code  Primitives 
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Minimum  Function  Box  Control  Operators 


A  1  B 

A  B  f  C 
A  f  /  B 


Single  Stream 
Corresponding  Elements 
ReductI  o  n 


B  f2"f1  C 

’f  C 

f2  :  f1 
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combine  function  s 
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